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ABSTRACT 

Supercomputers  capable  of  performing  extremely  high 
speed  computation  nave  been  proposed,  wnicn  are  based  on  an 
architecture  known  as  lata  flow.  Application  of  a  Petri 
net-based  metnodology  is  used  to  evaluate  tfte  performance 
attainable  by  such  an  architecture.  The  architecture 
evaluated  is  MIT's  cell  bloci  data  flow  architecture  which 
is  being  developed  to  execute  the  applicative  programming 
language  7AL. 

Results  Show  that  for  the  data-  flow  architecture  to 
achieve  its  «oal  of  high  speed  computation,  intelligent 
multiprogramming  schemes  need  to  be  developed.  One  suca 
scheme,  based  on  tne  notion  of  a  "concurrency  vector",  is 
introduced. 
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I.  INTRODUCTION 

A.   BACK:GROaND 

Despite  tne  orders-of-magni tude  increase  in  computation 
speed  that  Has  occurrei  since  tne  early  1950'5,  tne  need 
still  exists  today  for  faster  computers.  Tnis  need  is  most 
critical  in  tne  area  of  scientific  computing,  wnere  tnere 
exist  computations  requirin^r  on  the  order  of  a  Million 
floating  point  operations  per  second  [DENNIS,  1980]. 

One  approach  to  achieving  higher  computation  speed  is  to 
increase  tne  speed  of  tne  basic  logic  devices  of  the 
coTiputer.  This  approach,  effective  in  the  past,  faces 
significant  obstacles  to  future  gains  because  of  the  speed 
of  liffht  limitation  to  signal  propaeation  and  limitations  in 
the  integrated  circuit  manufacturing  process. 

A  second  approach  to  achieving  higher  computation  speed 
is  through  the  exploitation  of  parallelism  which  is  (or  can 
be)  inherent  in  algorithms  used  to  solve  a  wide  range  of 
scientific  problems.  Such  parallelism  can  be  present  at  both 
the  operation  and  procedure  levels  in  a  program.  Thus  far, 
such  exploitation  of  parallelism  has  not  reached  a  limiting 
threshold  to  faster  computation. 

Data  flow  computing  has  been  proposed  as  a  conceptually 
viable  method  of  achieving  nlgher  computational  speed 
through   greater  exploitation   of   inherent   algorithmic 
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parallelisTi.  A  computer  based  on  tie  data  flow  concept 
executes  an  instruction  when  its  operands  become  available. 
No  sequential  control  flov  notion  exists.  Data  flow  programs 
are  free  of  sequencing  constraints  except  tnose  imposed  by 
tne  flow  of  operands  between  instructions.  TJius ,  a  data  flow 
coupuier  contrasts  fundamentally  witn  tne  "von  Neumann" 
model.  Even  so,  tne  data  flow  concept  is  capable  of 
incorporating  into  one  system  all  the  known  forms  of 
parallelism  exploitation  including  vectorization , 
pipelining,  and  multiprocessing. 

B.   RESEARCH  APPROACH 

It  was  tne  purpose  of  tnis  researcn  to  gain  insignts 
into  the  decree  of  parallelism  exploitation  obtainable  witn 
the  data  flow-based  hign  speed  computation  method.  The 
classical  issues  of  hardware  utilization,  program  execution 
time,  and  "deeree"  of  muliipro^rammine  were  investigated  in 
tne  context  of  data  flow.  Application  of  an  existing  Petri 
net-based  methodology  was  the  technique  used  to  ^ain  these 
insights. 

The  hypothesis  for  this  research  had  two  parts.  First, 
the  suitability  of  the  Petri  net-based  Requester-Server 
metnodology  for  prediction  of  tne  performance  of  data  flow 
machines  in  an  efficient,  accurate  manner  was  to  be 
explored.  Second,  a  cnallenge  to  tne  data  flow  concept  was 
made.  It  was  hypothesized  that  tne  poal  of  achieving  higher 


speed  coTiputation  tnrougn  data  flow  computing  Is 
unattainable  wittiout  acnieving  a  higa  and  "intelliffent" 
degree  of  multiprogramming.  By  "intelligent"  it  is  meant 
that  the  mapping  of  processes  onto  the  hardware  shall  have 
to  be  done  in  a  near  optimal  fashion,  defined  in  terms  of 
hardware  utilization  and  program  execution  time. 

To  explore  the  two-part  Hypothesis,  sets  of  Petri  net 
models  of  data  flow  programs,  characterized  by  a  range  of 
inherent  parallelism,  were  "executed"  on  Petri  net  models  of 
the  Dennis-Misunas  data  flow  Hardware  design  [DENNIS,  AUG 
1974],  usin?  the  methodology  called  Requester-Server  [COX, 
1978].  The  nardware  models  were  varied  in  the  number  of 
processing  elements  available  for  use  in  "executing"  the 
sets  of  program  models.  Thus  software  models  were  "run"  on 
hardware  models,  and  appropriate  performance  indices  were 
measured  and  analyzed. 

This  research  is  important  because  it  suggests  a  method 
for  mapping  data  flow  programs  onto  the  data  flow  machine  to 
achieve  tne  desired  degree  of  high  speed  computation. 
Additionally,  the  Requester-Server  (R-S)  methodology  has 
been  shown  to  be  an  effective  tool  for  predicting  the 
performance  of  data  flow  computer  architectures. 

Grateful  acknowledgment  is  made  to  L.  A.  Cox,  designer 
and  initial  implementer  of  tne  R-S  methodology,  and  to  D.  M. 
Stowers,  who  modified  the  R-S  software  to  enable  it  to  run 
on  tne  PDP-11/50  minicomputer  at  NPS. 
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C.   ORGANIZATION 

T&e  results  reported  nere  are  organized  in  a  lasnion 
conducive  to  the  communication  of  experimental  computer 
science  endeavors,  following  a  review  of  tne  applicable 
literature  (Section  II),  tne  hypotftesis  (Section  III)  is 
presented.  Next,  tne  method  used  to  test  the  hypothesis  is 
presented  in  detail  (Section  IV).  This  section  discusses  tne 
experimental  design  which  includes  the  identification  of 
independent  and  dependent  variables,  cnaracteri zes  and 
explains  the  Petri  net  definition  of  the  data  flow  hardware 
and  software,  and  ends  witn  an  account  of  the  procedure  used 
to  implement  the  experiment  to  test  the  hypothesis.  Results 
of  the  experiment  and  a  discussion  thereof  are  covered  in 
Section  7.  This  section,  in  addition  to  demonstrating  tne 
suitability  of  the  R-S  technique,  and  exploring  the 
multiprogranming  response  of  data  flow  architectures, 
presents  some  unexpected  findings.  Section  VI  summarizes  the 
entire  research  effort,  including  the  results.  Finally, 
Section  VII  presents  recommendations  for  furtner 
investigation  in  the  area  of  data  flow  researcn. 


II.  LITERATURE  REVIEW 

A.   APPROACHES  TO  PARALLELISM 

In  general,  computer  science  literature  approacnes  tne 
concept  of  parallelism  exploitation  from  eitner  an 
architecture  (aardware)  or  language  (software)  point  of 
view.  In  t&is  tnesis,  "parallelism"  snail  be  viewed  as 
existing  at  many  hierarchical  levels  within  algorithms.  Any 
of  several  different  computer  architectures  may  he  capable 
of  exploiting  the  parallelism  which  exists  at  one  or  more  of 
these  various  hierarchical  levels.  As  should  be  expected, 
each  architecture  is  best-suited  at  exploiting  inherent 
algorithmic  parallelism  at  a  particular  hierarchical  level, 
but  not  at  otners.  In  contrast,  implementation  of  tne  data 
flow  concept  proposes  to  exploit  inherent  alfforithmic 
parallelism  at  all  hierarchical  levels,  in  an  efficient 
fashion.  Before  presenting  tne  concept  of  data  flow,  a 
review  of  the  range  of  architectures  and  strate,eies 
currently  used  to  exploit  parallelism  is  presented. 

In  an  early  study  of  high  speed  computer  architectures 
[FLYNN,  1966J  a  four  element  taxonomy  was  developed  which 
classified  computer  systems  in  terms  of  the  amount  of 
parallelism  in  their  instruction  streams  and  data  streams 
(see  figure  II. A. 1).  (An  instruction  stream  is  the  series  of 
operations  used  by  the  processor;   a  data   stream   is   the 
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series  of  operands  used  by  tne  processor.)  Tne  first  eietient 
of  this  taxonomy,  depicted  in  figure  II. A. 1(a),  is  the 
serial  computer  wnich  executes  one  instruction  at  a  time, 
affecting  at  most  one  data  item  at  a  time.  Sucn  a  serial 
machine  is  denoted  as  a  single-instruction 
single-data-stream  (SISD)  computer.  Tne  SISD  computer  can  be 
characterized  as  possessing  no  capability  for  exploitation 
of  algorithmic  parallelism.  The  three  remaining  computer 
system  organizations  within  the  Flynn  taxonomy  do  possess 
capabilities  for  exploiting  algorithmic  parallelism. 

By  allowing  more  than  one  data  stream  a 
single-instruction  multiple-data-stream  computer  results,  as 
Shown  in  figure  II. A. 1(b).  This  organization  allows 
vectorlzation  and  is  icnown  as  a  vector  or  array  processor 
because  each  instruction  operates  on  a  data  vector  during 
each  instruction  cycle,  rather  than  on  Just  one  operand.  The 

model  in  figure  II. A. 1(b)  shows  N  processors  each  accepting 
as  input  its  own  data  stream.  It  is  noteworthy  that  eacn  of 
the  N  processors  is  not  a  standalone  serial  machine  (SISD 
co-nputer)  because  tne  N  processors  tate  tne  same  instruction 
from  an  external  control  unit  at  each  time  step. 

If  the  SISD  computer  is  extended  to  permit  more  than  one 
instruction  stream,  the  multiple-instruction 
single-data-stream  (MISD)  computer  snown  in  figure  II. A. 1(c) 
results.  This  computer  system  organization  within  Flynn's 
taxonomy  is  present  for  completeness  but  has  yet  to  be  shown 
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to  possess  mucJi  utility.  An  example  of  sucn  a  machine  would 
be  one  built  to  generate  tables  of  functions  (sucn  as 
squares  and  square  roots)  of  a  stream  of  numbers.  Each 
processor  would  perform  a  different  function  on  the  same 
data  item  at  each  time  step. 

The  fourth  and  final  element  of  Flynn's  taxonomy  is  one 
which  possesses  parallelism  in  both  the  instruction  and  data 
streams.  This  multiple-instruction  multiple-data-stream 
(MIMD)  computer  (shown  in  figure  II. A. 1(d))  is  made  up  of  N 
complete  SISD  machines  which  are  interconnected  for 
communication  purposes.  Such  a  parallel  architecture  is  more 
readily  recognized  as  a  multiprocessor  in  which  as  many  as  N 
processors  can  be  performing  useful  wort  at  the  same  time. 

Beyond  Flynn's  taxonomy  are  other  approaches  to 
parallelism.  The  first,  pipelining,  is  a  strategy  which 
mates  use  of  the  fact  that  a  processor,  in  executing  an 
instruction,  actually  performs  a  sequence  of  functions  in 
various  functional  units  of  the  processor.  Each  function  is 
performed  at  a  different  stage  alone  the  pipeline.  Figure 
II. A. 2  Shows  a  processor  with  a  simple  pipeline  design. 
Rather  than  waiting  for  each  instruction  to  be  completely 
executed  before  beginning  the  next  instruction,  the  pipeline 
processor  begins  execution  of  the  next  instruction  as  soon 
as  functional  units  at  the  befflnnine  of  the  pipeline  are 
available.  Thus,  the  pipeline  is  normally  full,  containing 
more  than  one  instructions  in  various  stages  of  execution. 
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Ttie  final  approach  to  parallelism  to  be  presented  is  tfte 
strateffy  of  overlapping.  In  the  traditional  sense, 
overlapping  within  a  computer  system  occurs  wJien  the  central 
processing  unit  (CPQ)  is  allowed  to  function  concurrently 
with  input/output  (I/O)  operations.  Such  concurrency  was 
prevented  in  early  computers  because  I/O  operations  required 
data  paths  to  memory  which  ran  through  CPU  registers, 
preventing  CPU  functions  from  occurring  while  performing 
I/O,  Overlapping  can  occur  in  otner  ways  witnin  a  computer 
but  the  example  given  is  sufficient  to  convey  the  general 
idea. 

The  techniques  for  exploiting  algorithmic  parallelism 
that  nave  been  presented  are  not  all  mutually  exclusive.  Jor 
example,  the  strategies  of  pipelining  and  overlapping  can  be 
included  in  any  of  the  four  architectures.  Furthermore, 
other  more  complex  machine  organizations  have  been  proposed. 
One  example  is  the  multiple  SIMD  (MSIMD)  macnine  which 
consists  of  more  than  one  control  units  snaring  a  pool  of 
processors  through  a  switching  networit  [HWANG,  1979]  .  Such 
hybrids  will  not  be  considered  further. 

Having  described  the  major  architectural  approaches  to 
exploiting  algorithmic  parallelism  it  is  appropriate  to 
characterize  the  problems  for  which  each  method  is  suitable 
and  to  present  some  of  the  difficulties  that  still  exist  in 
using  each  method.  By  "suitable"  it  is  meant  that  the  method 
allows  the  processing  of  a  problem  in   such  a  manner   that 
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soTie  speedup  In  execution  tlTie  is  achieved  in  comparison 
with  what  tae  execution  time  would  be  for  the  problem  run  on 
a  serial  macnine. 

The  main  Implementation  of  the  SIMD  architecture,  the 
array  processor,  is  suitable  for  computations  whicn  can  be 
described  by  vector  instructions.  Also,  operands  processed 
simultaneously  must  be  capable  of  being  fetched 
simultaneously  from  memory.  Finally,  processor 
interconnections  must  support  high  speed  data  routine 
between  processors.  If  any  of  tne  above  conditions  are  not 
met,  then  the  computation  may  execute  in  a  predominantly 
serial  fasnion  within  tnis  SIMD  computer.  Because  of  these 
required  conditions,  the  array  processor  is  generally 
considered  to  be  a  speciallized,  rather  than  general 
purpose,  machine. 

As  previously  mentioned,  the  MISD  architecture  exists 
merely  to  complete  tne  Flynn  taxonomy  and  will  not  be 
discussed  further  [STONE,  1980]. 

The  dominant  MIMD  computer  is  the  multiprocessor.  The 
multiprocessor  is  considered  to  be  a  general  purpose 
machine.  Accordingly,  many  problems  should  be  well-suited 
for  execution  on  such  an  arcnitecture.  Despite  the  fact  that 
such  systems  have  been  shown  to  worfc  well  in  a  number  of 
applications,  especially  those  wnicn  consist  of  a  number  of 
concurrently-processable  subproblems  with  minimal  data 
sharing,   numerous   questions  have  yet  to  be  answered.  These 


16 


questions  include  now  to  best  organize  tne  parallel 
computations  (i.  e.  partition  the  problem)  to  optimize  the 
use  of  tne  cooperating  processors,  now  to  synchronize  tne 
processors  in  the  system,  and  how  to  test  share  the  data 
among  system  processors.  Also,  problems  which  possess  an 
iterative  structure  can  run  efficiently  on  an  array 
processor  and  avoid  the  overhead  of  synchronization  and 
scheduling  required  of  the  multiprocessor  [STONE,  1960J . 

Although  not  considered  "architectures"  in  the  sense  of 
Flynn's  taxonomy,  both  pipelining  and  overlapping  (of  which 
there  exist  different  types)  are  general  purpose  strategies 
that  can  be  applied  to  most  problems.  Like  the 
multiprocessor,  these  techniques  also  permit  the 
partitioning  of  a  problem  so  that  several  operating  hardware 
pieces  can  function  concurrently.  Accordingly,  pipelining 
and  overlapping  are  often  considered  to  be  forms  of 
multiprocessing.  Tne  difference  lies  in  the  fact  that 
pipelining  and  overlapping  perform  partitioning  at  different 
hierarchical  levels  of  a  problem  than  does  the 
multiprocessing  technique. 

Armed  with  an  understanding  of  the  diverse  architectural 
approaches  to  parallelism  exploitation  that  nave  been  used 
to  date,  it  is  logical  to  proceed  with  an  alternative 
approach,  that  of  data  flow.  Before  doing  so,  however,  Petri 
nets  will  be  introduced.  This  is  appropriate  because  the 
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concepts  of  data  flow  computation  are  a  direct   application 
of  Petri  net  tneory. 

B.   PETRI  NETS 

Petri  net  tneory  plays  an  Important  part  in  this 
research  endeavor  for  two  reasons.  First,  as  nas  been 
mentioned,  Petri  net  tneory  forms  the  hasis  for  the  concepts 
used  to  describe  and  define  data  flow  computation.  Second, 
Petri  net  theory  Is  the  basis  for  the  Requester-Server 
methodology  that  is  used  in  this  thesis  researcn  as  a 
computer  performance  prediction  tool.  Because  of  its 
applicability,  Petri  net  theory  shall  be  presented  herein, 
in  an  infor-nal  manner,  with  empnasis  placed  on  its  use  in 
modelling  parallel  computation.  Those  desiring  a  more  formal 
and  complete  discussion  of  Petri  nets  are  referred  to 
[PETERSON.  1977], 

Petri  nets  may  be  thought  of  as  formal,  abstract  models 
of  information  flow.  Their  main  use  has  been  in  tne 
modelling  of  systems  of  events  in  which  some  events  may 
occur  concurrently  but  there  exist  constraints  on  the 
frequency,  precedence  and  concurrence  of  these  events.  A 
Petri  net  graph  models  the  static  structure  of  a  system.  The 
dynamic  properties  of  a  system  can  be  represented  by 
"executing"  the  Petri  net  in  response  to  the  flow  of 
information  (or  occurrence  of  events)  in  the  system. 
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The  static  grapn  of  a  Petri  net  is  made  up  of  two  types 
of  nodes:  circles  (called  places)  wnicft  represent 
conditions,  and  bars  (called  transitions)  which  represent 
events.  These  nodes  are  connected  by  directed  arcs  running 
from  either  places  to  transitions  or  transitions  to  places. 
The  source  of  a  directed  arc  is  the  input,  and  the  terminal 
node  is  the  output.  The  position  of  information  in  a  net  is 
represented  by  markers  called  tokens. 

Tne  dynamic  execution  of  a  Petri  net  is  controlled  by 
the  position  and  movement  of  the  tokens.  A  token  moves  as  a 
result  of  a  transition  firing.  In  order  for  a  transition  to 
firet  it  must  be  enabled.  A  transition  is  enabled  when  all 
of  the  places  which  are  inputs  to  a  transition  are  marked 
with  a  token.  Upon  transition  firing,  a  token  is  removed 
from  each  of  the  input  places  and  a  token  is  placed  on  each 
of  tne  output  places  of  tne  transition.  Tnus,  in  modelling 
tlie  dynamic  behavior  of  a  system,  the  occurrence  of  an  event 
is  represented  by  tne  firing  of  the  corresponding 
transition. 

Figures  II.B.l  through  11.5.4  show  a  Petri  net  at 
progressive  stages  of  execution.  As  can  be  observed,  the 
status  of  the  execution  at  a  given  time  can  be  described  by 
the  distribution  of  the  tokens  in  the  net.  Tnis  distribution 
of  tokens  in  a  Petri  net  is  called  the  net  marking  and 
uniquely  defines  tne  state  of  tne  net  for  any  given  instant. 
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Petri  nets  are  uninterpreted  models.  Thus,  some 
significance  must  be  attached  to  token  movement  to  indicate 
the  intent  of  the  model.  This  is  usually  done  by  labelling 
the  nodes  of  a  net  to  correspond  in  some  way  to  the  system 
being  -nodelled.  However,  it  should  be  remembered  that  the 
labelling  of  the  nodes  of  a  Petri  net  in  no  way  affects  its 
execution,  k  second  attribute  of  Petri  nets  is  tneir  ability 
to  model  a  system  hierarchically.  An  entire  net  may  be 
replaced  by  a  single  node  (place  or  transition)  for 
modelling  at  a  greater  level  of  abstraction  or,  conversely, 
a  single  node  may  be  replaced  by  a  subnet  to  show  greater 
detail  in  the  model. 

Petri  nets,  as  a  formal  graph  model,  are  especially 
useful  in  modelline  the  flow  of  information  and  control  in 
systems  which  can  be  characterized  by  asynchronous  and 
concurrent  behavior.  Figure  II. B. 5  shows  the  Initial  marKin? 
of  a  Petri  net  model  of  such  a  system.  Initially,  transition 
El  is  enabled  because  each  of  its  input  places,  CI  and  CH, 
is  marfced  with  a  token.  Firing  transition  El  removes  one 
toten  each  from  places  CI  and  C2,  and  puts  a  token  into  each 
output  place,  C3  and  Ci,  At  this  point  in  the  net  execution, 
transition  E3  Is  disabled  because  one  of  its  input  places, 
C5,  still  has  no  token.  Transition  E2,  nowever,  is  enabled, 
and  upon  firing  causes  a  token  to  be  removed  from  place  C3 
and  one  deposited  in  place  C5.  As  an  aside,  this  portion  of 
the  model   could   correspond   to   a   system   sequencing 
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constraintt  tnat  of  event  E3  navlng  to  wait  until  event  E2 
coinpletes.  Upon  firini?  transition  E3,  places  C6  and  C7 
become  martced  witft  totens  as  places  C4  and  C5  lose  a  tofe'en 
each.  Transitions  E4  and  E5  are  now  enabled  and  can  fire 
simultaneously,  the  occurrence  of  which  corresponds  to 
concurrent  events  in  a  modelled  system.  Doing  so,  tnat  is, 
firing  transitions  £4  and  E5,  brings  the  Petri  net  model 
bacif  to  its  original  (initial)  configuration. 

One  other  situation  that  can  be  represented  usin«r  Petri 
nets  is  that  of  conflict.  Figure  II. B. 6  shows  a  net  model  of 
such  a  situation.  Simply,  transitions  El  and  E2  are  both 
enabled.  However,  if  either  transition  fires,  the  remaining 
transition  becomes  disabled.  In  such  a  case,  it  is  an 
arbitrary  decision  as  to  which  one  fires.  Because  we  would 
llfce  to  be  able  to  duplicate  experiments  and  obtain  the  same 
results,  a  scheme  that  is  often  used  involves  simply 
assigning  priorities  to  transitions  which  are  subject  to 
conflict  in  a  net.  In  this  way  reproducible  results  can  be 
ensured.  If  true  nondeterminism  is  desired  in  such  a  model, 
a  scheme  in  which  probabilities  are  associated  with  each 
transition  can  effectively  model  nondeterminism  in  tne 
system  under  study. 

Thus,  to  properly  model  a  system  with  Petri  nets,  every 
sequence  of  events  in  the  modelled  system  should  be  possible 
in  the  Petri  net  and  every  sequence  of  events  in  the  Petri 
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Figure    II.B.l:      MARKED    PETRI    NET,    TIME=0 
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Figure    II. B. 2:      MARKED    PETRI    NET,    TIME=1 
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Figure    II.  B.  3:      MARKED    PETRI    NET,    TIME=2 
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Figure    II. B. 4:      MARKED    PETRI    NET,    TIME=3 
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Figure  II. B. 5:   PETRI  NET  GRAPH  MODELLING 

ASYNCHRONOUS,  CONCURRENT  BEHAVIOR 
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Figure  II. B. 6 


PETRI  NET  GRAPH  MODELLING 
CONFLICT 


25 


net  should  represent  a  possible   sequence   in   the  modelled 
system. 

This  section  has  introduced  Petri  nets  and  demonstrated 
tneir  usefulness  in  fortially  modelling  information  and 
control  flow  in  systems  characterized  by  asynchronous  and 
concurrent  behavior.  Readers  interested  in  the  use  of  Petri 
nets  for  performance  evaluation  of  sucn  systems  are  referred 
to  [RAMAMOORTHT ,  1980]  and  [RAMCHANDANI ,  1974].  The 
following  section  snail  introduce  readers  to  the  concept  of 
data  flow  which,  as  mentioned  previously,  is  a  direct 
application  of  Petri  net  theory. 

C.   CONCEPT  OF  DATA  FLOV 

Data  flow  computing  is  a  method  of  multiprocessing  which 
proposes  to  exploit  inherent  algorithmic  parallelism  at  all 
hierarchical  levels  within  a  program.  Additional  objectives 
include  effectively  using  the  capabilities  of  LSI  technoloey 
and  simplifying  the  programming  tasi.  The  concept  of 
computation  under  data  flow  was  derived  by  Dennis  [DENNIS, 
1974]  (and  a  number  of  others  wording  independently), 
predominantly  from  Karp  and  Miller's  [KARP,  1966]  worft  on 
computation  graphs.  This  section  begins  by  presenting  the 
data  flow  concept  from  the  perspective  of  language,  rather 
than  that  of  architecture.  This  approach  is  appropriate  in 
view  of  the  fact  that  data  flow  computer  systems  are  heine 
designed  as  hardware  interpreters  for  a  base   language   that 
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is  fundamentally  different  from  conventional  laneua^es.  A 
hardware  description  of  tne  Dennis-Misunas  data  flow 
architecture  design  completes  tnis  section. 

In  a  data  flow  computer,  an  Instruction  is  executed  as 
soon  as  its  operands  become  available.  No  notion  of  separate 
control  flow  exists  because  the  data  dependencies  define  the 
flow  of  control  in  a  data  flow  program.  In  fact,  data  flow 
computers  have  no  need  for  a  program  location  counter. 

This  contrasts  with  the  traditional  "von  Neumann" 
conputer  architecture  model  which  uses  a  global  memory  whose 
state  is  altered  by  the  sequential  execution  of 
instructions.  Such  a  model  is  limited  by  a  "bottleneck" 
between  the  computer  control  unit  and  the  Global  memory 
[BACKUS,  1978].  This  "feature"  allows  conventional  languages 
to  have  side-effects,  a  common  example  of  which  is  tne 
ability  of  a  procedure  to  modify  variables  in  the  calling 
program.  Such  side-effects  are  prohibited  under  tne  data 
flow  concept.  Furthermore,  in  data  flow,  no  variables  exist, 
nor  are  there  any  scope  or  substitution  rules.  In  fact,  the 
data  flow  concept  prohibits  the  modification  of  anything 
that  has  a  value.  Rather,  data  flow  computing  tafces  inputs 
(operands)  and  generates  outputs  (results)  that  have  not 
previously  been  defined.  Thus,  instructions  in  data  flow  are 
pure  functions.  This  is  necessary  so  that  instruction 
execution  can  be  based  solely  on  the  availability  of  data 
(operands).  Thus  the  data  dependencies   must   be  equivalent 
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to,  and  in  fact  define,  tne  sequencing  constraints  in  a 
program.  Also,  to  exploit  parallelism  at  all  levels,  it  must 
be  possible  to  derive  tnese  data  dependencies  from  tne  nigh 
level  lansuaffe  program  instructions  [ACKERMAN,  1979J . 

A  language  waicn  allows  processing  by  means  of  operators 
applied  to  values  is  called  an  applicative  language.  VAL 
(Value-oriented  Algoritamic  Language)  is  a  nigh  level  data 
flow  applicative  language  under  development  at  MIT  [ACKERMiN 
and  DENNIS,  1979].  It  prevents  any  side-effects  by  requiring 
programmers  to  write  expressions  and  functions;  statements 
and  subroutines  are  not  allowed  in  the  language.  Because  of 
this  constraint,  most  concurrency  is  apparent  in  a  hieh 
level  language  program  written  in  VAL.  For  tne  purposes  of 
this  research,  no  further  understanding  of  the  high  level 
language  of  data  flow  is  required.  Information  about  high 
level  language  alternatives  is  available  in  [McGRAW,  1980J  , 
[ACKERMAN  and  DENNIS,  1979],  and  [ACKERMAN,  1979J . 

At  What  would  correspond  to  the  assembly  language  level, 
a  data  flow  computation  can  be  represented  as  a  graph.  Ttie 
nodes  of  the  graph  correspond  to  operators  and  the  arcs 
represent  data  paths.  An  arc  into  a  node  represents  an  input 
operand  path?  an  arc  leaving  a  node  corresponds  to  a  result 
path.  Data  flow  graph  execution  occurs  as  operands  become 
available  at  each  node.  Vhen  the  input  arcs  of  a  node  each 
have  a  value  on  them,  the  node  can  execute  by  removing  those 
values,   conputing  the  operation,  and  placing  the  results  on 
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Figure  II.C.l:   DATA  FLOW  NODE  EXECUTION 
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funotlon      Stats   (X.Y.Z:   re»l     return!  real,   real) 
let 

Mean   real   :=    (X    +   Y   +    Z)   /   3; 

SD   real   :=   SQRT(   (X^   +   Y^   +   Z' 
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Mean    ,    SD 
endlet 
endfua 


)  /  3  -  Mean^  ); 


Figure  II. C. 2;   A  SIMPLE  STATISTICS 
FUNCTION  AND  ITS  DATA  FLOW  GRAPH 
[McGRAW,  1980] 
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t&e  output  arcs  (see  Figure  II.C.l).  The  example  data  flow 
^raph  in  figure  II. C. 2  computes  tne  mean  and  standard 
deviation  of  its  three  input  parameters. 

Such  a  graph  notation  is  useful  in  illustrating  tne 
various  levels  of  parallelism  in  a  program.  For  example,  a 
grapn  node  may  represent  a  simple  operator  such  as  addition, 
or  the  entire  statistics  function  of  figure  II. C. 2.  Tnus  tne 
data  flow  graph  notation  can  represent  parallelism  existing 
at  the  operator,  function  and  even  computation  level.  The 
graphs  execute  asynchronously,  nodes  firing  when  data  inputs 
are  available.  Thus  no  syncnronization  problem  exists  witn 
regard  to  accessing  shared  data.  Each  data  flow  path  can  be 
marfced  with  a  value  by  only  one  operator  node.  Once  a  value 
is  on  a  path,  no  operator  'can  modify  that  value.  The  value 
can  only  te  read  when  used  as  an  input  to  another  node 
[McGRAW,  19S0] . 

Again,  the  data  flow  graph  notation  merely  allows  a 
logical  representation  of  a  program  at  a  level  corresponding 
to  conventional  assembly  language.  This  logical 
representation  snail  now  be  extended  to  permit  tne  reader  to 
understand  the  basic  data  flow  hardware  instruction 
execution  mechanism.  A  simple  example  computation  that  shall 
be  used  to  facilitate  reader  understanding  is  the  following: 

Z=(X+I)'i*(X-T) 
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A  SIMPLE  DATA  FLOW  PROGRAM 
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Figure  11.0.4: 


AN  ACTIVITY  TEMPLATE  FOR 
THE  ADDITION  OPERATOR 
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Figure  II. C. 5:  PROGRAM  GRAPH  USING  ACTIVITY 
TEMPLATES  FOR  THE  DATA  FLOW  PROGRAM  GRAPH  OF 
Figure  II.  C.  3   [DENNIS,  19  8(3 
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Tne  grapn  representation  of  tiiis  computation  is  snown  in 
figure  II. C. 3. 

In  tne  extended  grapn  representation  scnemCt  a  data  flow 
program  exists  as  a  collection  of  activity  templates,  each 
tenplate  corresponding  to  a  node  In  tne  data  flow  program 
ffraph.  For  example,  figure  II. C. 4  s&ows  an  activity  template 
for  ttie  addition  operator.  T&ere  are  four  fields  in  tie 
activity  template.  Tne  first  field  denotes  tne  operation 
code  wnicH  specifies  the  operation  to  he  performed.  The 
second  and  third  fields  are  receivers,  which  are  locations 
waiting  to  receive  operand  values.  The  fourth  field  is  a 
destination  field  which  specifies  where  the  result  of  the 
operation  on  the  operands  is  to  go.  There  can  be  multiple 
destination  fields.  Figure  II. C. 5  shows  the  program  graph 
representation  of  figure  II. C. 3,  using  activity  templates. 

Activity  templates  have  been  developed  which  control  the 
routing  of  data  for  such  program  structures  as  conditionals 
and  iterations.  These  templates  are  mentioned  to  point  out 
the  fact  that  graph  nodes  can  represent  not  only  simple 
operands  but  can  also  represent  more  elegant  and  necessary 
constructs. 

Some  definitions  which  are  necessary  to  tne 
understanding  of  the  data  flow  instruction  execution 
mechanism,  follow.  First,  a  data  flow  program  instruction  is 
the  fixed  portion  of  an  activity  template  and  is  made  up  of 
the  opcode  and  tne  destinations. 
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instruction: 

<opcode,  (lestinations> 

Each  destination  field  provides  tne  address  of  some  activity- 
template  and  an  input  (or  offset)  denoting  wnicn  receiver  of 
the  template  is  the  target. 

destination : 

<address,  input> 

Data  flow  program  execution  occurs  as  follows.  The 
fields  of  a  template  which  has  heen  activated  (by  tbe 
arrival  of  an  operand  value  at  each  receiver)  form  an 

operation  packet: 

<opcode,  operands,  destinations> 

When  the  operation  pacfeet  nas  been  operated  upon,  a  result 
packet  of  the  form 

result  packet: 

<value,  destination> 

is  generated  for  each  destination  field  of  tne  original 
activity  template.  Result  packet  veneration  triggers  the 
placement  of  the  value  in  the  receiver  designated  by  its 
destination  field.  Thus,  at  a  logical  level,  data  flow 
program  execution  occurs  as  a  consequence  of  operation 
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pactcet  and  result  packet  movements  tnrougn  a  Tiacnlne 
described   in   detail    below. 

The  basic  data  flow  instruction  execution  mecbanism  is 
s&own  in  figure  II. C. 6.  Tne  data  flow  proffram,  consisting  of 
a  collection  of  activity  templates,  is  neld  in  tne  activity 
store  (see  figure  II. C. 6).  Eacti  activity  template  is 
uniquely  addressable  wittiin  tne  activity  store.  Wten  an 
instruction  is  ready  to  be  executed  (i.  e.  the  template  is 
enabled),  this  address  is  entered  in  the  instruction  queue 
unit  (established  as  a  FIFO  buffer). 

The  fetch  unit  is  then  responsible  for:  removing,  one  at 
a  time,  instruction  addresses  from  the  instruction  queue, 
fetching  the  corresponding  activity  template,  forming  an 
operation  pacfeet  based  on  the  field  values  in  the  template, 
and  submitting  the  operation  paclcet  to  an  operation  unit  for 
processing.  The  operation  unit  processes  the  operation 
pacJcet  by:  performing  the  operation  specified  by  the  opcode 
on  the  operands,  formins  result  pacK-ets  (one  for  each 
destination  field  of  the  operation  paclcet),  and  transmitting 
the  result  pacfcets  to  the  update  unit.  The  update  unit  fills 
in  the  receivers  of  activity  templates  (designated  by  the 
destination  fields  in  the  result  packets)  with  the 
appropriate  values.  The  update  unit  is  also  responsible  for 
checfeing  the  target  template  to  see  if  it  has  all  receivers 
filled,  thus  enabling  the  template.  If  so,  the  address  of 
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the   enabled  template  is  added  at  the  end  of  the  instruction 

queue  by  the  update  unit. 

At  this  point  it  is  appropriate  to  discuss  how  and  where 

program  parallelism  can  be  exploited  by  tnis  nardware. 

"...once  the  fetch  unit  has  sent  an  operation  packet  off 
to  the  operation  unit,  it  .may  immediately  read  another 
entry  from  the  instruction  queue  without  waiting  for  the 
instruction  previously  fetched  to  be  completely  processed. 
Thus  a  continuous  stream  of  operation  packets  may  flow 
from  the  fetch  unit  to  th.e  operation  unit  so  Ion?  as  the 
instruction  queue  is  not  empty. 

"This  mechanism  is  aptly  called  a  circular  pipeline- 
activity  controlled  by  tne  flow  of  information  packets 
traverses  the  ring  of  units  leftwise.  A  number  of  packets 
may  be  flowing  simultaneously  in  different  parts  of  the 
ring  on  behalf  of  different  instructions  in  concurrent 
execution.  Thus  the  ring  operates  as  a  pipeline  system 
with  all  of  its  units  actively  processing  packets  at  ones. 
Tne  degree  of  concurrency  possible  is  limited  by  the 
number  of  units  on  the  ring  and  the  degree  of  pipelining 
within  each  unit.  Additional  concurrency  may  be  exploited 
by  splitting  any  unit  in  tne  ring  into  several  units  whicn 
can  be  allocated  to  concurrent  activities."  [DENNIS, 
NOV1980] 

The    Dennis-Misunas   data   flow   architecture   for 

implementing  the  described  instruction  execution  mechanism 

is   called  the  cell  block  architecture  and  is  illustrated  in 

figure  II. C. 7. 

"The  heart  of  this  arcnitecture  is  a  lar^e  set  of 
instruction  cells,  each  of  which  holds  one  activity 
template  of  a  data  flow  program.  Result  packets  arrive  at 
instruction  cells  from  the  distribution  network.  Each 
instruction  ceil  sends  an  operation  packet  to  the 
arbitration  network  when  all  operands  and  signals  have 
been  received.  The  function  of  the  operation  section  is  to 
execute  instructions  and  to  forward  result  packets  to 
target  instructions  by  way  of  the  distribution  network." 
[DENNIS,  NOV19B0] 
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Figure  II.C.B  reflects  a  practical  form  of  tne  ceil 
blocK  architecture  which  maices  use  of  LSI  technology  and 
reduces  the  number  of  devices  and  interconnections.  This 
practical  form  is  obtainable  by  grouping  the  instruction 
cells  of  figure  II. C. 7  into  blocfcs,  each  of  which  is  a 
single  device.  In  tnis  organization,  several  cell  tlocus  are 
serviced  by  a  group  of  multifunction  processing  elements. 
The  arbitration  networfc  channels  operation  packets  from  cell 
blocJcs  to  processing  elements.  Rather  than  employing  a  set 
of  processing  elements  each  capable  of  a  different  function, 
which  is  one  design  option,  use  of  one  multipurpose 
processing  element  type  is  the  favored  approach.  Such  an 
approach  precludes  the  need  for  tne  arbitration  network  to 
route  operation  packets  according  to  opcode.  Instead,  it 
sinply  has  to  forward  operation  pacicets  to  any  available 
processing  element.  It  is  this  design  which  forms  tne  basis 
for  the  system  model  used  in  this  research  effort. 

How  does  tne  basic  necnanism  relate  to  the  cell  bioci 
architecture?  Figure  II. C. 9  shows  a  cell  block 
implementation.  It  differs  from  the  basic  mechanism  in  two 
ways.  First,  the  cell  bloct  has  no  processing  element(s) 
(operation  unit(s)).  Second,  result  pacicets  targeted  for 
activity  templates  held  in  the  same  cell  block  must  traverse 
the  distribution  network  before  being  handled  by  the  update 
unit  [DENNIS,  NOV  1980J  .  This  is  tne  Dennis-Misunas  data 
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flow  architecture   design.   Other  designs   do   exist;   for 
examples,  see  [GOSTELOW,  1980J  and  [WATSON,  1979J . 

D.   COMPUTER  PERFORMANCE  PREDICTION 

Computer  performance  prediction  is  an  evaluation  process 
which  proposes  to  estimate  the  performance  of  a  system  not 
yet  in  existence  (i.e.  in  some  state  of  design). 
"Performance"  simply  means  how  well  a  system  worlds.  Tnis  in 
turn  connotes  the  concept  of  value.  So,  the  purpose  of 
estimating  the  performance  of  a  system  under  design  is  to 
determine  that  system's  expected  value. 

In   order  to   quantify  now  well  a  system  worlfs  or  shall 

worfc,  performance  metrics  called  indices  are  used.   Typical 

indices  and  their  definitions  are: 

THROUGHPUT  RATE  -      The  volume  of  information  processed 

hy  a  system  in  one  unit  of  time 

HARDWARE  UTILIZATION  -  The  ratio  between  the  time  tne 

hardware  is  used  during  an  interval 
Of  time,  and  the  duration  of  that 
interval  of  time 

RESPONSE  TIME  -        The  elapsed  time  t5etween  tne  sub- 
mission of  a  program  job  to  a  system 
and  completion  of  the  corresponding 
job  output. 

Computer   performance  prediction   can   be  achieved  via 

several  different  techniques.  Each  technique  has  limitations 

and  advantages.  The  technique  utilized   in   this   thesis   is 

that   of  simulation.   The  simulation  technique  involves  the 

representation,  by  a   model,   of   certain  aspects   of   tne 

behavior  of   a   system   in  the  time  domain.  Observing  tnese 
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aspects  of  the  t)ehavior  in  time  of  the  system's  model,  under 
inputs  generated  by  a  model  of  tne  system's  inputs,  produces 
results  useful  in  the  evaluation  of  the  modelled  system 
[FERRARI,  197SJ.  For  tne  purposes  of  this  research,  the 
aspects  of  behavior  that  are  of  interest  are  the  performance 
indices  previously  defined. 

Of  significant  importance  to  any  simulation  effort  are 
the  issues  of  validation  and  parameter  estimation. 
Conceptually,  validation  attempts  to  establish  some  degree 
of  confidence  that  the  simulation  shall  produce  results 
which  Shall  closely  correspond  with  the  performance  of  the 
system  under  scrutiny.  Parameter  estimation  provides  the 
simulation  effort  with  hopefully  credible  parameter  values 
needed  to  perform  a  simulation  having  relevant  results. 
These  issues  shall  be  addressed  in  section  IV. A: 
Experimental  Design. 

The  last  section  of  the  review  of  the  literature 
applicable  to  this  research  endeavor  presents  tne 
Requester-Server  methodoloffy.  The  Requester-Server 
methodology  is  the  "tool"  used  to  perform  the  simulation 
which  generates  the  results  on  which  the  prediction  of  data 
flow  performance  is  based. 

Readers  desiring  a  more  thorough  presentation  of  the 
subject  of  computer  performance  prediction  are  referred  to 
[FERRARI,  1979J,  [COX,  1978],  [ALLEN,  1980J  ,  [HA^^MIN(J, 
1975],  [SPRAGINS,  1980],  [BQZEN,  1980],  and  [SAUER,  1980]. 
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E.   REOOESTER-SERVER  METHODOLOGY 

The  Requester-Server  (R-S)  metHodolOffy  was  desiened  and 
Initially  implemented  by  L.  A.  Cox,  Jr.  [COX,  1978J  . 
Subsequently,  the  Requester-Server  software  was  modified  by 
D.  M.  Stowers  [STOWERS,  1979]  to  run  on  the  PDP-11/50 
minicomputer  at  NPS.  This  section  summarizes  those  portions 
of  [COX,  1978]  and  [STOWERS,  1979]  whicn  are  applicable  to 
and  necessary  for  the  understanding  of  this  research. 

The  R-S  methodology  is  capable  of  predlctinff  tne 
performance  of  computer  systems  characterized  by 
asynchronous,  concurrent  behavior.  The  methodology  can 
predict  performance  at  both  the  computer  system  and  computer 
Job  levels.  The  R-S  methodology  allows  the  user  to 
separately  specify  the  hardware  conf iguration(s )  to  be 
evaluated,  the  software  (programs)  to  be  used  in  evaluating 
the  hardware  conf iffuration(s) ,  and  the  mechanism  or  policy 
for  allocating  hardware  resources  to  program  requests  for 
service.  The  methodology  mates  provision  for  variable  levels 
of  detail  (in  a  hierarchical  sense)  in  both  the  hardware  and 
software.  Finally,  the  R-S  methodology  is  capable  of 
simulatine  concurrency  in  both  the  hardware  and  software. 
Thus,  for  a  given  hardware  configuration,  the  control 
structure  mandated  by  the  software  can  be  mapped  onto  tne 
hardware  and  system  performance  analyzed  and  predicted. 

The  simulation  process  is  begun  by  representing  the 
software  (programs)  and  hardware  as  two  separate  Petri   net 
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grapns.  In  tne  Petri  net  grapn  of  tne  software,  eacn  arc  can 
be  thought  of  as  bavins  an  associated  propagation  delay,  ttie 
extent  of  wnicti  is  dependent  upon  tne  nardware  configuration 
used  to  execute  tiie  program.  If  these  delays  are  definable 
by  their  correlation  to  the  Petri  net  model  of  the  hardware, 
then  perfor'nance  values  for  tne  indices  of  section  II. D 
(Computer  Performance  Prediction)  can  be  obtained  by 
executing  tne  Petri  net  model  of  the  software  on  the  Petri 
net  hardware  conf i«uration(s ) .  The  R-S  "tool"  serves  as  the 
interface  between  the  Petri  net  model  of  system  software  and 
Petri  net  -nodel  of  system  hardware.  This  interface  permits 
the  hardware  and  software  Petri  net  graphs  to  be  constructed 
separately.  Tnis  is  important  because  tne  control  structure 
and  sequencini?  constraints  of  both  hardware  and  software  can 
be  maintained  separately.  This  permits  a  direct  and 
meaningful  representation  of  both  the  system  software  and 
hardware  being  modelled. 

The  source  file  which  serves  as  tne  input  to  tne  R-S 
program  is  organized  into  three  sections.  The  software 
section  of  the  input  file  consists  of  a  description  of  tne 
Petri  net  graph  representing  the  software  program(s)  to  be 
executed.  This  net  graph  description  is  formulated  in  terms 
of  the  functions  and  constraints  of  the  services  required  of 
tne  hardware.  The  hardware  section  of  tne  input  file  is  made 
up  of  a  description  of  the  computer  system  components  and 
their   interconnections.   This    description    can    be 

46 


(nlerarcnically )  at  a  bit-level  or  major  component  level, 
depending  on  t&e  system  aspects  under  scrutiny.  The  Petri 
net  grapn  upon  wnich  tne  nardware  description  is  based  is 
constructed  in  terms  of  its  operation  in  time.  Tne  last 
section  of  t&e  input  file,  called  tne  dynamic  section, 
provides  tne  user  of  tne  R-S  "tool"  a  place  to  denote  system 
Initial  conditions  by  defining  tne  hardware  and  software 
nets'  tofeen  markings  at  tne  beginning  of  a  "run".  A,s  may  be 
recalled  from  section  II. B  (Petri  Nets),  both  the  software 
and  hardware  sections  merely  define  static  Petri  net 
structures.  Performance  prediction  follows  frcn  tne 
attachment  of  significance  to  the  structures  and 
restrictions  on  tofcen  movement  within  these  structures. 

The  dynamic  nature  of  Petri  nets  is  exploited  by  this 
R-S  methodology  as  follows.  The  software  net  representation 
mafees  a  series  of  requests  for  the  services  of  tne  nardtoare 

net  representation.  Repeatedly,  the  R-S  process  maps  these 
requests  for  service  onto  tne  hardware  net  representation. 
At  each  "invocation"  the  R-S  process  "runs"  the  hardware  nel 
to  provide  the  service  requested  by  the  software  net.  Upon 
completion  of  each  of  the  service  requests,  the  R-S  process 
"runs"  the  software  net  representation  until  the  hardware  is 
again  needed.  This  cycle  repeats  itself  until  tne  software 
net  representation  has  been  completely  "run"  and  its 
terminal  state  reached. 
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Events  in  tne  hardware  net  grapn  correspond  to 
operations  in  time.  A  collection  of  events  is  used  to 
represent  each  functional  unit.  Tolfen  movement  through  t^e 
hardware  net  graph  corresponds  to  the  flow  of  data  and 
control  throueh  the  modelled  hardware  system.  A  simple 
hardware  net  description  is  provided  in  figure  lI.E.l. 
Events  in  the  software  net  graph  correspond  to  requests  for 
service.  As  an  example,  an  event  could  equate  to  a  request 
for  a  floating  point  multiplication.  The  flow  of  tokens  in 
the  software  net  srapn  equates  to  the  loeical  flow  of  the 
algorithm,  constrained  by  its  implicit  data  dependencies  or 
sequencing  constraints.  A  simple  software  net  description  is 
provided  in  figure  II. E. 2. 

Together,  tne  software  and  hardware  net  graphs  can  be 
executed  in  such  a  way  as  to  simulate  the  operation  of  the 
computer  system  for  the  given  software  worlcload.  The 
interaction  of  the  two  net  graphs  is  orchestrated  by  the  R-S 
tofcen  arhiter.  Networlr  simulation  bpgins  with  the  marJcing  cf 
the  "begin"  node  of  the  software  net  ^raph.  This  net  graph 
is  then  executed  as  would  be  any  Petri  net  graph.  The 
arrival  of  a  tofeen  at  any  place  in  the  software  net  graph 
indicates  a  request  for  service,  at  which  time  the  R-S  toKen 
arbiter  tafces  control.  (The  type  of  service  requested  is 
denoted  by  the  type  of  the  place  and  is  defined  in  the 
software  net  description.)  The  R-S  tofcen  arbiter  removes  the 
toten  from  the  software  net  and  then   permits   the  software 
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net  ?rap&  to  continue  executing  until  no  furtner  Tioves  are 
possible.  Tne  R-S  toien  arbiter  tnen  initializes  tne 
hardware  functional  unit  (net  grapn  denoted  by  the  type  of 
service  requested)  by  iiarKin^  it  witn  tokens.   Tne  nardware 

net  srraph  is  then  executed  one  step.  Totcens  reaching  events 
corresponding  to  service  completion  are  removed,  and  the 
toten  of  the  software  net  which  originally  caused  the 
request  for  service  is  replaced,  by  the  R-S  tolcen  arbiter. 
Repeating  this  sequence  of  actions  results  in  the  execution 
of  tne  software  net  grapn  by  the  nardware  net  grapn.  A 
sample  input  file  dynamic  section  and  the  results  obtained 
from  executing  tne  software  and  hardware  net  graph 
descriptions  of  figures  lI.E.l  and  II. E. 2  are  presented  in 
figure  II. E. 3.  Those  readers  interested  in  the 
Requester-Server  metflolology  are  referred  to  [COX,  1978J  and 
[STOWERS,.  1979]   for  a   more   in-depth  discussion   of  its 

capabilities  and  usage. 

This  completes  the  necessary  review  of  the  literature 
required  to  understand  the  researcn  tnat  follows.  The 
fundamental  concepts  of  the  various  approaches  to 
parallelism,  Petri  nets,  tne  lata  flow  architecture, 
computer  performance  prediction,  and  the  Requester-Server 
methodology  have  been  reviewed.  The  next  section  presents 
tne  two-part  hypothesis  wnicn  tnis  researcn  addresses  and 
tests. 
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A  SAMPLE  HARDWARE  NET  GRAPH  AND 
INPUT  FILE  DESCRIPTION 


50 


r\ 


00 

Pi 

w 

Eh 

cn 

M 

CD 

w 

p^ 

z 

M 

CD 

S 

H 

K 

.. 

Eh 

s 

>-i 

< 

P:i 

p^ 

W 

CD 

:>  ^^ 

O 

w 

cc; 

w    +     + 

a. 

S  ^o  S 

2: 

M 

< 

CD    11     11    Q 

p^ 

M              S 

Eh 

oa  M  J  w 

Pi; 

o 

U-, 

CJ               CJ 

o 

M 
Eh 

w 

CO 
M 
Pi 
CU 
M 
« 

Eh 
M 

M 

W 


W  LO  un  o 

p^ 

•  ^ 

.«>H  H  w  w  S 

s 

W  Eh  CU  Cu  Oh  O 

o 

hJ        >h  >h  >h  M 

M                             •" 

O^b^b^  B^  ^b^ 

Eh                              W 

S  S                     M 

M                              J 

<  W  Eh  Eh  Fh  C/^ 

00                         CU 

X  >  S  S  S  S 

S                  s 

W  pq  w  w  w  <; 

>  >  >  (^   .« 

<               < 

Pi;                 X 

SSWWWtnS    •-•- 

H           •«      w 

<;  M                              l-H  >^  ^^ 
«CD^'-DQrHCD+   + 

.«   .«Q 

04  M  t-D   S           S 

CDW+   +SEhW'-d2: 

Eh   +    +    W          < 

O  CQ  i-D  S  W  OO  CQ 

CO  ►t)  S                 !^ 

tf;                                       Eh  H 

« #* 

b^    --CD 

PhWWWWWEhDD 

rH 

W  Eh  Eh  r)  CvJ  O 

P^Q^P^P^P^ZDIXtfX, 

^ 

p::  D  D  O,  Eh  Pi; 

00 

<  a,  a.  Eh  00  (X, 

J  s  12:  D 

CDOOOOCJMOO 

Q 

CJ  M  M  O  Q  Q 

w  w  w  w  w  w 

S 

W                    S  IS 

cQ  GQ  G  aa 

W 

Q                    WW 

Figure   II. E. 2: 


A  SAMPLE  SOFTWARE  NET  GRAPH  AND 
INPUT  FILE  DESCRIPTION 
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BEGIN  DYNAMIC  NET; 

MARK  GATE  WITH  1; 

COMMENT:  GATE  ENABLED  TO  ALLOW 
ONLY  ONE  OPERATION  IN 
PROGRESS  AT  ANY  TIME. 

EXECUTE  10; 

COMMENT:   EXECUTE  TEN  HW  CYCLES 

OR   UNTIL   PROGRAM  IS 

COMPLETE. 

END  DYNAMIC  NET; 


S5:   EXECUTE  10; 

*PROGRAM  EVENT  J+K  REQUESTS  HW  SVCS(l) 
*PROGRAM  EVENT  M+J  REQUESTS  HW  SVCS(l) 

TIME  =  1 
TIME  =  2 
TIME  =  3 

TIME  =  4 
TIME  =  5 
TIME  =  6 


^PROGRAM  EVENT  J+K  COMPLETESO) 


•^PROGRAM  EVENT  M+J  C0MPLETES(6) 
36:   END  DYNAMIC  NET; 


Figure  II. E. 3:   A  SAMPLE  INPUT  FILE  DYNAMIC  SECTION 

AND  OUTPUT  FILE  LISTING 
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III.  HYPOTHESIS 

Because  their  exist  several  data  flow  architecture 
proposals,  it  is  desirable  to  have  a  tool  witn  whicn  to 
predict  the  performance  of  the  diverse  designs  for 
comparison  purposes.  Tne  first  part  of  this  research's 
hypothesis  was  that  the  Petri  net-based  Requester-Server 
(R-S)  methodology  is  such  a  tool,  capable  of  predicting  the 
performance  of  data  flow  architectures  in  an  efficient, 
accurate  manner.  In  effect,  the  R-S  tool  was  to  be  tested. 

Tae  second  part  of  tnis  researcn's  nypotnesis  was 
concerned  with  the  Dennis-Misunas  data  flow  architecture 
design.  This  design  was  chosen  for  two  reasons.  First,  there 
existed  adequate  information  In  tne  literature  about  this 
design  on  which  to  base  an  accurate  model  for  simulation 
purposes.  Second,  tne  Dennis-Misunas  design  of  the  basic 
instruction  execution  mechanism  is  essentially  the  same  as 
several  other  schemes  in  various  stages  of  implementation 
[DENNIS,  1979].  The  hypothetical  challenge  to  tnis  design 
was  that  the  goal  of  achieving  higher  speed  computation  is 
not  attainable  unless  a  nign  and  "intelligent"  degree  of 
multiprogramming  is  realized,  as  shall  be  explained  next. 

Obviously,  high  speed  computation  shall  require  a  high 
hardware  utilization.  By  this  it  is  meant  that  most  of  the 
processing  elements   (PE's)   shall   have   to   be   performing 
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useful  woric  most  of  tne  ti^e.  Sucn  a  nign  nardware 
utilization  is  attainable  wnen  eitner  of  two  situations 
occurs.  First,  a  nign  nardware  utilization  will  result  when 
a  process  possessin?  a  laree  amount  of  inherent  parallelism 
is  being  run  (by  itself)  on  the  machine.  In  this  case,  a 
program's  execution  time  is  dependent  upon  its  amount  of 
inherent  parallelism  and  the  number  of  PE's  in  the  machine. 
Second,  a  high  hardware  utilization  is  attainable  when  a 
multiprogramming  environment  (in  which  several  processes  are 
permitted  to  simultaneously  run  on  the  machine)  is 
instituted.  In  such  a  multiprogramming  environment,  an 
individual  process  shall  be  competing  for  hardware  resources 
(PE's).  Thus,  that  process'  execution  time  may  be  lengthy 
regardless  of  its  amount  of  inherent  parallelism.  This  is  so 
because  that  process  may  have  the  use  of  only  a  small 
portion   of   the   machine's  resources  (PE's)  at  any  point  in 

time.  Put  another  way,  if  at  any  time  a  process  has  N 
instructions  available  for  execution,  but  there  are  less 
than  N  PE's  available  for  executing  those  instructions  in  a 
parallel  fashion,  then  the  process'  execution  time  shall  be 
lenethened  over  what  it  could  be  if  it  had  sufficient  PE's 
available. 
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In  sucn  a  situation,  a  scne:7ie  may  be  needed  to  implement 
a  policy  wbicb  achieves  two  ol)jectives: 

1,  maintaining  nign  nardware  utilization  and 

2.  providing  an  acceptable  average  response  time  for 
a  user  requiring  a  given  amount  of  processing. 

"Acceptable  average  response  time"  is  construed  to  mean  tnat 
tne  actual  response  time  of  any  particular  program  which 
requires  a  ^iven  amount  of  processinff  shall  not  be 
lengthened  considerably  over  what  it  would  be  if  tne  program 
were  executed  by  itself  on  the  data  flow  macnine.  Thus,  it 
shall  be  desirable  to  minimize  the  affect  of  system  load  on 
an  individual  program's  execution  time.  That  the  second 
objective  should  be  met  even  at  the  expense  of  the  first 
objective  is  a  strong  point  made  by  [KLEINROCK,  1976J  .  By 
merely  mapping  processes  onto  the  data  flow  machine  as  they 
arrive,  it  is  expected  that  objective  #1  shall  be  achieved 
but  at  the  expense  of  objective  #2.  This  situation  was 
expected  to  be  demonstrated  by  this  research. 

The  purpose  of  this  section  has  been  to  "frame"  the 
research  area  by  presenting  the  issues  which  ^ive  rise  to 
the  hypothesis.  The  following  section  presents  the  method 
used  to  test  the  hypothesis,  and  includes  a  discussion  of 
the  assumptions  male  to  facilitate  the  simulation 
experiments,  where  undecided  design  issues  remain. 
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17.  METHOD 

A.   EXPERIMENTAL  DESIGN 

The  experiment  to  allow  prediction  of  data  flow  computer 
performance  involved  executing  sets  of  Petri  net  models  of 
data  flow  proerams  on  Petri  net  models  of  data  flow 
hardware.  The  Requester-Server  (R-S)  program  tool  monitored 
the  data  flow  (model)  programs'  "execution"  and  provided 
data  which  permitted  the  determination  of  the  performance 
indices:  response  time  and  hardware  utilization.  It  is 
important  to  realize  that  the  results  of  this  research 
predict  the  performance  of  a  model  of  a  data  flow  computer- 
not  that  of  an  operating  data  flow  machine  itself.  (Model 
validation  and  parameter  estimation  issues  are  addressed  in 
section  IV. B:  Data  Flow  Hardware  Definition.) 

The  reader  who  is  familiar  with  analytic  modelling 
employing  queueing  theory  may  ast  wny  tnat  technique,  rather 
than  the  simulation  technique,  is  not  used  to  predict  tHe 
performance  of  the  data  flow  design.  The  answer  is  that  the 
analytic  approach  unnecessarily  constrains  the  prediction  by 
requiring  assumptions  to  he  made  ahout  the  software. 
Specifically,  tne  Petri  net  models  of  software  programs  are 
discretely  defined  with  regard  to  the  amount  of  inherent 
parallelism  available  for  exploitation  at  each  time  step  in 
program  execution.  To  model  analytically,  the  variability  of 


56 


inherent  parallelism  available  for  exploitation  must  be 
described  by  probability  distributions  wnicb  nide  tne 
definable  nature  of  the  programs  at  discrete  time  steps. 

For  tnis  experiment,  tne  data  flow  arcnitecture  was  to 
be  modelled  with  several  different  quantities  of  processing 
elements  (PE's).  Tne  sample  data  flow  program  models  were  to 
be  characterized  by  varying  but  definable  amounts  of 
inherent  parallelism  available  for  exploitation.  Each 
(model)  program  was  to  be  separately  run  on  each  (model) 
hardware  configuration.  (Hereafter,  the  word  "model"  shall 
be  omitted  but  assumed  in  referring  to  the  program  and 
hardware  models  used  in  this  experiment.)  Data  was  to  be 
obtained  to  permit  determination  of  the  performance  indices 
(response  time  and  hardware  utilization),  for  each  run,  from 
tne  monitor  function  of  tne  R-S  tool.  After  running  each 
program  separately,  arbitrary  program  mixes  were  to  be  run 
on  each  hardware  configuration  and  the  same  performance 
indices  again  determined.  Finally,  hand-optimized  program 
mixes  were  to  be  run  on  each  hardware  configuration  and  the 
sane  performance  indices  determined  once  again.  By 
evaluating  the  results,  the  hypothesis  was  expected  to  be 
either  supported  or  refuted. 

The  independent  variable  for  this  experiment  was  defined 
to  be  tne  quantity  of  PE's  available  to  tne  hardware  model. 
Because  PE's  are  but  one  resource  demanded  by  a  process  in 
execution,  other  independent  variable  choices   could  nave 
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included  other  resources  such  as:  the  quantity  of  cell 
blocks  available,  the  type  of  distribution  or  arbitration 
networfc  employed,  and/or  the  type  of  PE's  (multipurpose  or 
sets  of  single-purpose  functional  units)  utilized.  Expanding 
the  number  of  independent  variables  increases  significantly 
the  complexity  of  evaluating  the  results.  How  these  issues 
were  resolved  is  explained  in  section  IV. B. 

The  dependent  variables  for  this  experiment  were  the 
parameters  response  time  and  hardware  utilization.  The 
results  of  the  experiment  were  expected  to  provide  data 
which  could  be  plotted  on  graphs.  Curves  plotting  the 
execution  time  of  each  data  flow  program  against  the  number 
of  PE's  would  constitute  one  such  graph.  Others,  and  their 
significance,  are  presented  in  section  V:  Results  and 
Discussion. 

B.   DATA  FLOW  HARDWARE  DEFINITION 

The  Petri  net  models  of  the  data  flow  hardware 
configurations  were  quantified  in  terms  of  their  operation 
In  time.  Such  quantification  required  assigning  tine 
duration  values  to  each  portion  of  the  cell  bloct 
architecture  -nodel  in  sucn  a  way  as  to  closely  model  tne 
hardware.  Doing  so  required  several  assumptions  to  be  made. 
Those  assumptions  shall  be  addressed  individually  so  as  to 
help  substantiate  the  credibility  of  the  resultant  hardware 
models. 
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To  tegln,  the  processing  elements  (PE's)  were  assumed  to 
te  multifunctional,  capable  of  executing  any  instruction 
routed  to  it  in  one  "standard"  instruction  execution  tirre 
unit.  Allowing  tne  PE's  to  be  multifunctional  and 
characterized  by  a  singular  execution  time  simplifies  tne 
modelling  process. 

There  were  at  least  two  other  possibilities  that  could 
be  accommodated  by  expansion  of  the  Petri  net  models.  First, 
each  multifunction  PE  could  be  replaced  by  a  set  of 
single-purpose  PE's,  each  sin*rle-purpose  PE  defined  in  terms 
of  its  particular  instruction  execution  time,  and  capable  of 
executing  concurrently  with  other  PE's  of  the  set.  Second, 
each  PE  could  be  replaced  by  a  subnet  in  which  only  one 
instruction  could  be  executed  in  any  given  time  step,  but 
the  model  would  define  the  execution  time  as  a  function  of 
the  instruction  type. 

The  first  alternate  approach  implies  a  more  complex 
arbitration  network  with  a  conceivably  longer  routing  time. 
The  second  alternative  would  require  additional  net 
complexity.  (However,  this  approach  would  be  a  good 
possibility  for  subsequent  research.)  Because  the  actual 
Implementation  configuration  has  not  been  finalized, 
modelling  the  PE's  as  multifunctional  and  characterized  by  a 
singular  execution  time  was  a  reasonable  path  to  follow. 

The  distribution  network  design  also  has  not  been 
finallized.  For  ease   of   modelling  purposes,   a   crossbar 
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switch  design  capable  of  supporting  simultaneous  transfers 
of  result  pacfcets  to  cell  blocts  was  cnosen  to  be  modelled. 
This  choice  permitted  a  standard  routing  time  to  be 
Characterized  by  the  model.  Otner  networa:  designs, 
especially  packet  routing  networks,  may  "be  preferred  to  the 
crossbar  switch  for  the  ultimate  machine  because  of  their 
lower  cost  and  comparable  performance  in  a  data  flow 
architecture  [DENNIS,  1979J . 

The  Choice  to  model  the  PE's  as  multifunction  units 
precluded  the  need  for  anything  but  a  simple  arbitration 
network.  Such  a  network  would  merely  have  to  route  operation 
packets  to  any  available  PE.  Accordin<?ly ,  in  the  model,  a 
standard  routing  time  for  this  network  was  characterized. 

With  regard  to  the  cell  blocks,  tne  assumption  was  made 
that  sufficient  cell  blocks  were  available  to  hold  all 
portions  of  all  processes  being  run  on  tne  macnine  at  each 
and  every  instant.  Thus,  there  is  no  notion  of  paeine 
portions  of  processes  into  and  out  of  memory  (the  activity 
store  in  the  case  of  tne  data  flow  arcnitecture) .  This 
assumption  carries  with  it  the  assumption  that  all  program 
compilation  (resulting  in  extended  data  flow  graph-like 
representations)  is  complete  before  beginning  program 
execution.  Other  compilation  strategies  are  under 
consideration,  such  as  requiring  the  user  to  interact  with 
the  system  to  achieve  a  high  degree  of  parallelism 
exploitation  [McGRAW,  19S0J . 
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As  has  been  described,  other  hardware  choices 
(representine  Independent  variables  in  an  experiment)  can  he 
made  and  easily  Implemented  by  simply  defining  appropriate 
subnets  wnlcn,  in  a  time-wise  fasnion,  cnaracterlze  tne 
portions  of  the  hardware  under  scrutiny.  The  approach  tasen 
In  tftis  researcn  permitted  tne  nardware  timing 
characteristics  to  be  a  function  of  simply  tne  number  of 
PE's.  Figure  IV.B.l  is  the  Petri  net  representation  (of  the 
cell  blocic  architecture  nardware)  utilized  in  tnis  researcn. 
For  the  purposes  of  this  experiment  and  in  tne  conf ifi-uration 
described,  each  PE  was  assumed  to  be  driven  at  the  rate  of 
two  million  floating  point  operations  per  second  (FLOPs),  a 
rate  claimed  to  be  reasonable  by  [DENNIS,  1980J .  This  figure 
represents  an  instruction  execution  time  of  500  nanoseconds 
(nsec).  (This  is  represented  in  the  hardware  model  by 
signifying  a  scaling  of  eacn  event /transition  pair  to  equal 
100  nsec.)  Associating  timing  characteristics  wl tn  each 
component  in  the  data  flow  architecture  design  results  in  a 
similar  figure  as  snown  in  figure  IV. B. 2. 

PB  (instruction  execution)  50  nsec 

CELL  BLOCK  (memory  fetcn  assuming  MOS  tecnnology)  250  nsec 

DISTRIBOTION  NET*fORK  (assuming  crossbar  switch)  250  nsec 
ARBITRATION  NETWORK  (assuming  negligible) 

[WEITZMAN,  1980].  TOTAL:  550  nsec 


FIGURE  IV. B. 2:   TIMING  CHARACTERISTICS  OF  DATA  FLOW 

ARCHITECTURE  COMPONENTS 
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Figure  IV.B.l 


PETRI  NET  REPRESENTATION  OF  THE 
CELL  BLOCK  ARCHITECTURE  HARDWARE 
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For  the  purposes  of  tnis  research,  tne  quantity  of  PE's  in 
tne  nardware  was  varied  from  one  to  sixteen,  by  multiples  of 
two.  This  resulted  in  data  s-enerated  for  tne  following 
quantities  of  modelled  PE's:  1,  2,  4,  8,  and  16. 

This  summarizes  tne  assumptions  utilized  in  developing 
the  model  presented  here.  The  following  section  describes 
the  Petri  net  definition  of  the  data  flow  software-  program 
models  which  were  "executed"  on  the  hardware  models. 

C.   DATA  FLOW  SOFTVARE  DEFINITION 

The  Petri  net  models  of  data  flow  programs  were 
quantified  in  terms  of  the  amount  of  Inherent  parallelism 
available  for  exploitation  at  each  discrete  time  step  as 
well  as  in  terms  of  the  implicit  data  dependencies  of  the 
programs.  (As  previously  mentioned,  the  data  dependencies 
define  the  control  flow  of  a  program.)  Tne  initial  approach 
involved  talrine  sample  programs  written  in  the  high  level 
language  (nil)  VAL  and  converting  them  to  their  equivalent 
Petri  net  representations  for  subsequent  "execution"  on  tne 
data  flow  hardware  models.  The  problem  with  this  approach 
was  that  the  compilation  process  is  not  yet  developed.  Tnus, 
what  hardware  instructions  would  be  required  for  each  hll 
instruction  were  not  determinable. 

The  subsequent  approach,  which  was  utilized,  involved 
designing  Petri  net  program  models  characterized  by  various 
but   discretely  definable   levels   of  inherent  parallelism. 
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Executing  sucb  artificial  programs  conceivably  produced  more 
informative  results  than  would  nave  been  obtained  witn  a  few 
select  programs  which  may  have  only  demonstrated  data  flow's 
suitability  for  those  special  purpose  computations.  The 
individual  programs  shall  be  characterized  after  introducing 
a  new  concept. 

A  new  concept  introduced  at  this  point  is  that  of  a 
software  "concurrency  vector",  k  concurrency  vector  is  a 
tuple,  each  entry  of  which  defines  the  amount  of  inherent 
parallelism  in  a  program  at  the  operation  paclcet 
hierarchical  level,  at  a  discrete  instruction  execution  time 
step.  Each  entry  of  the  tuple  is  implicitly  subscripted  by 
the  time  step  it  describes.  For  example,  the  simple 
statistics  function  of  section  II. C  (see  figure  II. C. 2,  paee 
32)  would  he  characterized  by  the  concurrency  vector: 
(4,2,2,2,1,1).  In  this  example  concurrency  vector,  the  "4" 
represents  the  fact  that  the  four  operations  "+",  "SQ", 
"so",  and  "SO"  could  be  processed  in  parallel  during  the 
first  time  step  of  execution  of  the  simple  statistics 
function.  This  is  so  because  no  sequencing  constraints  exist 
among  these  four  operations.  Thus  the  concurrency  vector 
defines  how  many  operation  packets  could  be  parallel 
processed  if  all  the  instructions  (i.e.  functions-  addition, 
subtraction,  division,  square,  square  root)  were  implemented 
in  hardware.  (If  they  were  not,  the  subfunctionai  operation 
pacfcets   required   by  the  instruction  would  be  considered  in 
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defining  tne  concurrency  vector  entries.)  It  snould  also  be 
recognized  ttiat  tiie  concurrency  vector,  thoueb  a  function  of 
a  progra-n,  is  dependent  upon  a  standard  instruction 
execution  time  duration.  If  the  hardware  is  implemented  sucn 
that  execution  time  is  a  function  of  the  instruction  type, 
then  the  concurrency  vector  entries  could  he  described  at  an 
even  lower  level  than  the  operation  pacicet  level.  Such  a 
level  would  correspond  to  a  basic  hardware  cycle  time,  where 
executing  an  instruction  would  require  some  number  greater 
than  one  hardware  cycles  to  complete.  This  additional 
complexity  need  not  be  considered  in  this  research  in  view 
of  the  hardware  design  approach  taken,  but  could  be 
accommodated  by  the  R-S  methodology  used  here. 

Four  programs  i" k"  through  "d")  were  utilized  in  this 
research.  These  programs  are  diff erentiable  by  their  length 
as  well  as  by  the  amount  of  inherent  parallelism  available 
for  exploitation  at  each  time  step.  The  Petri  net 
representations  of  these  programs  are  shown  in  figures 
IV.C.l  through  17. C. 4.  Additionally,  tne  concurrency  vector 
for  each  is  shown.  The  program  mixes  for  this  experiment 
included  one  of  each  of  the  three  programs,  "a",  "B"  and 
"d".  Another  program  mix  including  one  of  each  of  the  four 
programs,  "a"  through  "d",  was  also  used. 

The  following  section  presents  the  procedure  utilized  in 
executing  the  experiment.  Additionally,  the  method  of 
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FIGURE    IV.C.l: 


(MODEL)    PROGRAM    "A" 
CV:     (4,3,2,1,2,3,4) 
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FIGURE    IV. C. 2:       (MODEL)    PROGRAM    "B" 

CV:     (8,8,8,8) 
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FIGURE    IV. C. 3 


(MODEL)    PROGRAM    "C" 

CV:     (6,6,5,5,4,4,3,3,2,2) 
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FIGURE    IV. C. 4:       (MODEL)    PROGRAM    "D" 

CV:     (3,5,7,2,1) 
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mapping   tne  program  mixes  onto  eacn  of   tne  nardvare 
configurations  is  explained. 

D.   PROCEDURE/IMPLEMENTATION 

The  four  procedural  steps  utilized  in  executing  tnis 
experiment  were  as  follows.  First,  Petri  net  models  of  botn 
tbe  hardware  configurations  (figure  I7.B.1)  and  software 
programs  (figures  IV.C.1-.4)  were  converted  to  a  format 
acceptable  as  input  to  the  Requester-Server  (R-S)  program. 
Two  Pascal  programs,  compiled  and  executed  on  the  NPS 
"B-side"  PDP-11  (a  UNIX-based  system),  facilitated  the 
(separate)  generation  of  the  hardware  and  software  portions 
of  the  input  files  for  the  R-S  program.  Eacn  input  file  was 
formed  by  concatenating  the  hardware  and  software  portions 
and  then  editing  the  resulting  file  to  define  the  dynamic 
execution  desired.  The  second  step  was  to  transfer  each 
complete  input  file  from  tne  NPS  "B-side"  to  the  NFS 
"A-side"  PDP-11  (an  RSX-llM-based  system),  via  an 
inter-processor  lini:.  Thirdly,  the  R-S  program  was  run  on 
the  "A-side",  tatlng  as  input  the  file  wnich  defined  the 
hardware,  software,  and  dynamic  execution  desired.  Fourth 
and  finally,  data  regarding  the  execution  of  the  software  on 
the  hardware  was  obtained  from  the  output  file  generated  by 
the  R-S  program.  The  results  of  the  data  analysis  are 
presented   and   discussed   in   section  7   (Results  and 

« 

Discussion) . 
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Tne  Implementation  portion  of  tnis  section  addresses  tne 
technique  used  to  map  the  software  onto  the  hardware  in  such 
a  way  as  to  effectively  simulate  this  function  as  it  might 
be  done  on  a  real  data  flow  machine.  The  first  set  of 
experimental  "runs",  which  consisted  of  the  separate  running 
of  each  program  on  each  hardware  configuration,  was 
straightforward  in  implementation.  The  procedure  described 
above,  in  which  each  program  file  portion  was  concatenated 
with  the  appropriate  hardware  file  portion,  achieved  a 
relevant  mapping  for  modelling  a  single  process  running  on  a 
particular  hardware  configuration.  The  subsequent  set  of 
experimental  "runs",  in  which  a  program  mix  was  mapped  onto 
each  of  the  hardware  conf isuratlons ,  was  not  so 
straightforward  in  implementing  as  shall  be  explained  next. 

To  understand  the  mapping  of  software  (i,  e.  processes) 
onto  data  flow  hardware,  it  is  helpful  to  scrutinize  the 
functions  of  the  operating  system  for  sucn  a  macnine. 
Because  the  scheduling  and  synchronization  of  concurrent 
activities  are  built  in  at  the  hardware  level,  a  data  flow 
machine's  operating  system  will  only  be  responsible  for 
initialization,  termination,  and  input/output  (I/O)  of 
processes.  Once  a  process  is  mapped  onto  the  data  flow 
machine,  it  runs  to  completion  without  further  intervention 
by  tne  operating  system  (except  for  I/O).  The  question  which 
must  be  answered  is:  When  should  another  process  te  mapped 
onto  a  machine  which  is  already  executing  one  or  more 
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processes?  Thus,  in  defining  the  input  file  of  a  program  mix 
representative  of  ready  processes,  the  program  mix  had  to  be 
defined  in  terms  of  a  mapping  function. 

THe  mapping  functions  can  be  thought  of  as  operating 
system  assignment  policies.  Thus,  for  those  runs  Involving 
program  mixes  (as  opposed  to  single  programs),  an  assignment 
policy  had  to  be  simulated.  One  aspect  of  this  research  then 
can  te  viewed  as  an  investigation  of  different  policies  for 
mapping  processes  onto  data  flow  machines  in  a 
multiprogramming  environment. 

Each  program  mix  consisted  of  the  three  programs  "a", 
"b"  and  "d".  (A  later  "run"  for  which  data  was  gathered 
utilized  the  four  program  mix  consisting  of  one  each  of  the 
programs  "a"  through  "d".)  Each  mix  was  varied  in  the  way  in 
which  It  was  mapped  onto  the  hardware,  in  simulating 
different  operating  system  mapping  functions.  The  operating 
system  assignment  policies  for  mapping  a  program  mix  onto 
the  hardware  configurations  follow.  Three  policies  were 
sinulated.  First,  the  tnree  programs  were  permitted  to  begin 
"execution"  at  the  same  time.  Second,  an  "80%  Rule"  was 
simulated  in  which  an  additional  program  was  permitted  to 
begin  "execution"  whenever  the  hardware  utilization  dropped 
below  80%.  Third,  an  "intelligent"  assignment  policy  was 
Implemented  via  a  mapping  function  based  on  the  programs' 
concurrency  vectors.  This  assignment  policy,  it  was 
envisioned,   would  cause  optimal  performance  in  terms  of  the 
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performance  indices:  response  time  and  Hardware  utilization. 
The  concurrency  vector  approach  optimizes  tne  assignment  of 
processes  onto  the  macnine  by  fitting  together  concurrency 
vectors  of  ready  processes  in  such  a  way  that  the  objectives 
noted  in  section  III  are  acnieved.  For  example,  given  a 
machine  with  eiffht  PE's,  the  concurrency  vectors  would  be 
fitted  as  saown  in  figure  IV.D.l. 


JOB 

"A": 

JOB 

"B": 

JOB 

"  _  " 

C  : 

JOB 

D  : 

JOB 

E  : 

TOTiL: 


(4,3,2,1,2,3,4) 
(3,5,7,6,5,2) 
(4,4,4,2,1) 
..8,4,4) 

(2,7,8,. 

•  ,S,o,8,8,o,o,8,8,S,S,7,3, • 
=  «==TIME=====> 


FIGURE  IV.D.l:   AN  EXAMPLE  OF   FITTING*  CONCURRENCY 

VECTORS  TOGETHER 


By  generating  tne  concurrency  vectors  at  compile  time,  the 
program  can  declare  beforehand  those  resources  (as  a 
function  of  tine)  needed  for  execution  as  well  as  wnen  ttie 
program  will  be  completed.  An  operating  system  can  thus 
choose  the  seo.uencing  of  the  running  of  the  waiting 
processes  to  achieve  the  best  fit  to  best  meet  tne 
objectives  of  section  III. 

The  results  of  these  experimental  runs  are  presented  in 
the  following  section  in  graphical  form.  Additionally,  tbe 
meaning  and  significance  of  the  results  are  discussed. 
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V.   RESaiTS  AND  DISCOSSION 

In  response  to  tne  first  part  of  tnis  researcn's 
hypothesis,  it  is  proposed  that  the  Petri  net-based 
Requester-Server  (R-S)  methodology  is  indeed  a  desirable 
tool  with  which  to  predict  the  performance  of  the  diverse 
designs  of  data  flow  architectures.  The  ability  to 
separately  specify  the  hardware,  software  and  resource 
allocation  policy  was  an  R-S  feature  which  permitted 
efficient  generation  of  the  combinations  of  the  above  three 
ite:ns.  The  ability  to  easily  implement  variable  levels  of 
detail  in  bota  the  hardware  and  software  was  not  exploited 
but  the  method  for  doing  so  was  introduced.  Finally,  tne  R-S 
methodoloffy's  capability  of  simulating  concurrency  and 
asynchronous  benavior  in  both  tne  nardware  and  software  is  a 
necessity  for  accurately  modelling  and  simulating  data  flow 
computing. 

The  results  which  address  the  second  part  of  this 
research's  hypothesis  are  now  presented.  To  begin,  figure 
V.l  shows  individual  program  execution  times  as  a  function 
of  the  number  of  processing  elements  (PE's).  These  absolute 
execution  times  were  used  as  a  basis  for  comparison  with  the 
results  from  the  multiprogramming  environment  runs.  Percent 
hardware  utilization  is  displayed  adjacent  to  each  data 
Doint.   The   hardware  utilization  values  are  averages  of  tne 
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hardware  utilizations  at  each  tine  step  during  execution. 
Tne  grapn  snows  tnat  a  program's  execution  ti-ne  can  be 
drastically  reduced  by  increasing  the  number  of  processing 
resources  (PE's)  available  up  to  tne  point  wnere  execution 
tine  is  bounded  by  the  amount  of  inherent  parallelism 
available  for  exploitation  in  tne  program. 

The  followine  results  pertain  to  the  running  of  the 
program  mixes  in  a  simulated  multiprogramming  environment. 
Initially,  it  was  intended  tiiat  program  mixes  would  contain 
a  greater  quantity  of  programs  than  were  actually  run.  This 
was  not  acnieved  due  to  time  constraints.  Accordingly,  the 
results  should  be  considered  preliminary  in  nature.  On  a 
positive  note,  the  results  provide  insight  into  several  data 
flow  operation  issues.  Figure  V.2  provides  the  raw  data  for 
this  research  with  the  exception  of  the  data  utilized  in 
computing  nardware  utilization.  Figures  V.3,  V.4,  and  V.5 
present  the  hardware  utilization  (as  a  function  of  time)  for 
the  4-,  8-,  and  16-PE  configurations.  Similar  graphs  for  the 
1-  and  2-PE  configurations  are  presented  in  figure  V.6  (note 
the  identical  nature). 

The  implications  supported  by  this  data  follow.  Tney  are 
not  definitive  because  of  the  small  quantity  of  programs  and 
program  mixes  in  the  model,  thus,  while  tne  second  part  of 
the  hypothesis  may  not  have  been  adequately  tested,  the 
metnodology  for  doing  so  appears  to  be  available  in  the  R-S 
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tool.  Additionally,  it  is  maintained  tnat  the  initial 
results  support  tne  discussion  wnicn  follows. 

Whenever  the  amount  of  concurrency  in  all  running 
processes  exceeds  tne  number  of  PE's  available  to  meet  tfie 
retjulrements  of  tne  processes,  some  slowdown  in  execution 
time  results  for  some  processes.  The  dual  of  this  result  is 
that,  so  lone  as  there  are  adequate  PE's  available,  no 
slowdown  in  any  process'  execution  time  results. 

Under  tne  "All  Begin  Together"  (ABT)  assignment  policy, 
the  data  flow  hardware  becomes  "overloaded",  resulting  in 
tfie  slowdown  Just  described.  For  example,  program  "a", 
though  the  first  to  bei?in  processing?  under  the  ABT  scneme, 
is  the  last  to  finish  under  the  three  program  mix.  Under  the 
four  program  mix  wltfi  16  PE's,  the  "c"  program  ta&es  longer 
only  because  of  its  ^reat  length  and  inherent  parallelism. 

Under  tne  "80%  Rule"  assignment  policy,  the  average 
hardware  utilization  is  lower  than  tnat  under  tne  ABT  policy 
(for  the  three  program  mix).  Also,  the  average  of  the  three 
programs'  execution  times  is  lower  tnan  tnat  under  tne  AET 
policy.  For  the  four  program  mix,  the  average  hardware 
utilization  is  sligntly  greater.  This  reflects  a  better 
mapping  of  processes  onto  the  machine.  (Average  nardware 
utilization  is  defined  as  the  average,  over  the  duration  of 
a  run,  of  the  hardware  utilization  percentages  at  eacn  time 
step  of  that  run. ) 


82 


Under  the  optimized  concurrency  vector  (CV)  approach, 
programs  were  mapped  onto  tne  nardware  configurations  in 
such  a  way  as  to  acnieve  a  hi^n  nardware  utilization  at  eacn 
time  step  as  well  as  minimize  the  average  response  time  of 
tne  prograns  in  tne  mix.  Tne  results  indicate  average 
hardware  utilizations  at  least  as  ni^n  as  under  either  of 
tne  otner  assignment  policies.  Also,  tne  average  response 
times  were  at  least  as  low  as  under  either  of  the  other 
assignment  policies.  This  optimized  concurrency  vector 
approach  should  be  suitable  for  machine  optimization.  Using 
concurrency  vectors  generated  at  compile  time,  the  mapping 
of  additional  processes  onto  tne  data  flow  macnine  should 
probably  continue  only  so  long  as  acceptable  average 
response  time  for  any  process  is  not  exceeded.  Wnen  a 
process  characterized  by  more  inherent  parallelism  than  can 
be  currently  accommodated  on  the  data  flow  macnine  is 
awaiting  asslsmment  (i.  e.  mapping  onto  tne  data  flow 
machine),  that  process'  assignment  should  be  delayed  until 
sufficient  (or,  if  necessary,  all)  PE's  are  available  to 
parallel  process  the  computation.  (A  user  advisory  denoting 
such  a  delay  would  be  highly  desirable.) 

The  time  spent  in  preprocessing  Jobs  in  accordance  with 
any  assignment  scheme  to  achieve  some  level  of  optimization 
may  be  unnecessary  or  even  wasteful.  This  trade-off  will 
have  to  te   examined  in  greater  depth. 
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VI.   SOMMAP.Y 

Followlne  a  review  of  the  pertinent  literature,  a 
two-part  bypothesls  was  proposed.  First,  the  Petri  net-based 
Requester-Server  (R-S)  methodolosry 's  suitability  for 
predicting  the  performance  of  data  flow  machines  was  to  be 
tested.  Second,  it  was  Hypothesized  that  tne  goal  of 
economically  achieving  higher  speed  computation  through  data 
flow  computing  would  be  unattainable  without  achieving  a 
high  and  intelligent  degree  of  mul  tiprofframminar.  The  R-S 
methodology,  a  simulation  technique,  permits  the  separate 
specification  of  the  hardware  to  be  evaluated,  the  software 
to  l)e  used  in  the  hardware  evaluation,  and  the  policy  for 
allocatlne  hardware  resources  to  program  requests  for 
service.  Accordingly,  Petri  net  models  of  data  flow  hardware 
configurations  were  quantified  in  terms  of  their  execution 
in  time,  and  Petri  net  models  of  data  flow  programs  were 
quantified  in  terms  of  the  amount  of  inherent  parallelism 
available  for  exploitation  at  each  discrete  time  step,  as 
well  as  in  terms  of  the  implicit  data  dependencies  of  tne 
program.  Model  programs  were  "run"  on  model  hardware 
configurations.  Results  obtained  from  the  monitor  function 
of  tne  R-S  program  were  analyzed,  with  respect  to  tne 
performance  indices:  hardware  utilization  and  response  time. 
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Three   assignment   policies   for   determining  when   to  nap 
additional  proerams  onto  a  data  flow  machine  were  tested: 

1.  all  programs  begin  together 

2.  assign  an  additional  program  whenever  the 
hardware  utilization  drops  below  30^5 

3.  assisrn  an  additional  program  based  on  a 
concurrency  vector. 

Results  show  that  the  R-S  methodology  is  indeed  an 
efficient  and  easy-to-use  tool  for  investigating  data  flow 
architectures.  Also,  initial  results  indicate  tnat  optimized 
scheduling  based  upon  concurrency  vectors  is  viable  for 
deciding  when  to  map  additional  processes  onto  a  data  flow 
machine  to  achieve  the  objectives  of  maintaining  hieh 
fiardware  utilization  and  providing  acceptable  average 
response  time. 
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VII .   RECOMMENDATIONS  FOR  FURTHER  RESEARCH 

With  regarl  to  the  methodology,  wort&wniie  additions  to 
the  R-S  program  would  be  user-f riendiy  "front-"  and 
"bacfc-ends"  wnicti  would  further  simplify  both  the  veneration 
of  input  files  for  tne  R-S  tool  and  tne  retrieval  of  desired 
data  from  the  output  file  erenerated  by  each  run. 

In  the  area  of  data  flow,  simulations  in  which  the 
hardware  definition  of  tne  arcnitecture  was  varied  (as 
described  in  section  17. B)  could  provide  insights  regarding 
tne  optimal  Hardware  configuration  for  tne  expected  program 
load.  In  particular,  the  quantity  of  (modelled)  PE's  siiould 
^e  increased  to  a  number  closer  to  the  amount  expected  in 
the  actual  machine  (approximately  512).  Of  course,  in  order 
to  model  more  accurately,  tne  expected  load  in  terms  of  the 
quantity  of  programs  and  typical  amounts  of  inherent 
parallelism  shall  have  to  be  defined  more  exactly. 

A  final  open  area  within  data  flow  research  is  the 
development  and  testing  of  specific  algoritnms  using 
concurrency  vectors  to  permit  machine  optimization  and 
implementation  of  a  desirable  assignment  policy. 
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